Skip to content

cluster(lb): fix load balancer stuck degraded during bootstrap#48

Open
mweibel wants to merge 1 commit into
mainfrom
fix-loadbalancer-readyness
Open

cluster(lb): fix load balancer stuck degraded during bootstrap#48
mweibel wants to merge 1 commit into
mainfrom
fix-loadbalancer-readyness

Conversation

@mweibel

@mweibel mweibel commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

During initial cluster bootstrap the load balancer could get stuck in degraded or error and never recover on its own. Two things combined to cause this:

  • The health monitor was created before any control-plane node existed. With a monitor but no healthy members, the LB reports degraded/error.
  • The reconciler never requeued a non-running LB, so it never observed the LB recovering once the control-plane nodes came up. This was especially the case if it took a bit of time until control plane nodes were available.

Changes:

  1. Requeue every 10s while the LB is pending (degraded/error)
  2. Defer creating the health monitor until the control plane is initialized (at least one control-plane node up).
  3. Move health monitor reconciliation to the end of reconcileNormal, after the load balancer and floating IP, since it depends on both being set.

Also fixes the cloudscaleMachineToCluster watch mapping, which filtered on the wrong label (MachineControlPlaneNameLabel instead of MachineControlPlaneLabel) and so never enqueued on control-plane machine changes.

During initial cluster bootstrap the load balancer could get stuck in
`degraded` or `error` and never recover on its own. Two things combined
to cause this:

- The health monitor was created before any control-plane node existed.
  With a monitor but no healthy members, the LB reports `degraded`/`error`.
- The reconciler never requeued a non-running LB, so it never observed the
  LB recovering once the control-plane nodes came up. This was
  especially the case if it took a bit of time until control plane nodes
  were available.

Changes:

1. Requeue every 10s while the LB is pending (`degraded`/`error`)
2. Defer creating the health monitor until the control plane is
   initialized (at least one control-plane node up).
3. Move health monitor reconciliation to the end of `reconcileNormal`,
   after the load balancer and floating IP, since it depends on both
   being set.

Also fixes the `cloudscaleMachineToCluster` watch mapping, which filtered
on the wrong label (`MachineControlPlaneNameLabel` instead of
`MachineControlPlaneLabel`) and so never enqueued on control-plane
machine changes.
@mweibel mweibel force-pushed the fix-loadbalancer-readyness branch from a737632 to 944c5e1 Compare June 26, 2026 14:52
@mweibel mweibel changed the title cluster(lb): improve initial bootstrap behavior cluster(lb): fix load balancer stuck degraded during bootstrap Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant